Transform a text input into a matrix. The chosen method is word embedding using the word2vec algorithm and continuous bag of words (CBOW) model.
import pandas as pd
import plotly.express as px
from gensim.models import Word2Vec
from sklearn.decomposition import PCA
import warnings
warnings.filterwarnings('ignore')
f = open("input.txt", "r")
input_text = f.read()
This stage transform the paragraphs into sentences. The sentences are splitted into words, all words in lowercase.
preproc_text = input_text.replace('\n',"")
preproc_text = preproc_text.replace(',',"")
preproc_text = preproc_text.lower()
preproc_text = preproc_text.split('.')
sentences = []
for i in range(len(preproc_text)):
sentences.append(
list(filter(None, preproc_text[i].split(' ')))
)
The word2vec algorithm is applied to the sentences. A sample from the resulting matrix is shown.
model = Word2Vec(sentences, min_count=1)
X = model[model.wv.vocab]
print(pd.DataFrame(X).head())
0 1 2 3 4 5 6 \
0 0.000737 -0.003690 -0.004953 0.003770 -0.000865 -0.004579 -0.003184
1 0.004020 -0.003855 -0.004203 -0.002301 -0.002879 0.001773 -0.003745
2 0.003379 -0.001403 0.001870 0.004768 -0.001600 0.001841 -0.001205
3 -0.000772 0.004323 0.002763 -0.001727 0.004760 0.002191 0.002292
4 -0.003344 -0.002078 -0.005008 -0.002970 0.000375 0.002823 -0.001100
7 8 9 ... 90 91 92 93 \
0 -0.001963 0.002724 -0.000444 ... -0.001683 0.001264 -0.003918 -0.002678
1 -0.001699 -0.002148 -0.000379 ... 0.001898 0.004911 0.003572 0.002609
2 0.001978 -0.004985 -0.003824 ... -0.000123 0.000506 -0.001037 -0.003690
3 0.000629 -0.003037 0.000201 ... 0.003004 0.001614 0.003805 0.004840
4 0.002775 -0.000532 0.002395 ... -0.000029 0.002512 0.000998 0.003894
94 95 96 97 98 99
0 -0.003121 0.003800 -0.002218 0.002712 -0.002991 -0.004579
1 -0.004123 -0.000234 0.003999 0.000785 0.004780 -0.001694
2 -0.002990 -0.001056 0.004387 0.001186 -0.003893 -0.003539
3 -0.000439 0.000632 -0.004742 0.001435 -0.000616 -0.004235
4 -0.003075 -0.001502 0.002311 0.003611 -0.002026 0.002693
[5 rows x 100 columns]
pca = PCA(n_components=2)
result = pca.fit_transform(X)
words = list(model.wv.vocab)
fig = px.scatter(result, result[:, 0], result[:, 1], text=words)
fig.update_traces(textposition='top center')
fig.show()